Classification is the process of predicting the class of given data points. Classes are sometimes called as targets/ labels or categories. Classification predictive modeling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y).
For example, spam detection in email service providers can be identified as a classification problem. This is s binary classification since there are only 2 classes as spam and not spam. A classifier utilizes some training data to understand how given input variables relate to the class. In this case, known spam and non-spam emails have to be used as the training data. When the classifier is trained accurately, it can be used to detect an unknown email.
Classification belongs to the category of supervised learning where the targets also provided with the input data. There are many applications in classification in many domains such as in credit approval, medical diagnosis, target marketing etc.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
drug_df = pd.read_csv("drug200.csv")
drug_df.head()
| Age | Sex | BP | Cholesterol | Na_to_K | Drug | |
|---|---|---|---|---|---|---|
| 0 | 23 | F | HIGH | HIGH | 25.355 | DrugY |
| 1 | 47 | M | LOW | HIGH | 13.093 | drugC |
| 2 | 47 | M | LOW | HIGH | 10.114 | drugC |
| 3 | 28 | F | NORMAL | HIGH | 7.798 | drugX |
| 4 | 61 | F | LOW | HIGH | 18.043 | DrugY |
drug_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 200 entries, 0 to 199 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 200 non-null int64 1 Sex 200 non-null object 2 BP 200 non-null object 3 Cholesterol 200 non-null object 4 Na_to_K 200 non-null float64 5 Drug 200 non-null object dtypes: float64(1), int64(1), object(4) memory usage: 9.5+ KB
We can see that there are no missing/null values in the data set.
drug_df.Sex.value_counts()
M 104 F 96 Name: Sex, dtype: int64
drug_df.BP.value_counts()
HIGH 77 LOW 64 NORMAL 59 Name: BP, dtype: int64
drug_df.Cholesterol.value_counts()
HIGH 103 NORMAL 97 Name: Cholesterol, dtype: int64
drug_df.describe()
| Age | Na_to_K | |
|---|---|---|
| count | 200.000000 | 200.000000 |
| mean | 44.315000 | 16.084485 |
| std | 16.544315 | 7.223956 |
| min | 15.000000 | 6.269000 |
| 25% | 31.000000 | 10.445500 |
| 50% | 45.000000 | 13.936500 |
| 75% | 58.000000 | 19.380000 |
| max | 74.000000 | 38.247000 |
print('Age skewness: ', drug_df.Age.skew(axis=0, skipna=True))
Age skewness: 0.03030835703000607
print('Na_to_K skewness: ', drug_df.Na_to_K.skew(axis=0, skipna=True))
Na_to_K skewness: 1.039341186028881
sns.displot(drug_df['Age'], kde=True)
<seaborn.axisgrid.FacetGrid at 0x7fde319d9d60>
sns.displot(drug_df['Na_to_K'], kde=True)
<seaborn.axisgrid.FacetGrid at 0x7fde34226c40>
fig = px.histogram(drug_df, x='Drug',color='Sex', height=500, width=900)
fig.update_layout(
template="seaborn",barmode='group', xaxis={'categoryorder':'total descending'},
title='Distribution of Drug Types by Gender')
fig
DrugX are taken by equal numbers of males and females. DrugY mostly taken by females. DrugA, DrugB and drugC are mostly taken by male patients.
fig = px.histogram(drug_df, x='Age',color='Cholesterol', height=500, width=900)
fig.update_layout(
template="seaborn",barmode='group', xaxis={'categoryorder':'total descending'},
title='Distribution of Age by Cholesterol Level')
fig
More patients in the 35-39 age group have normal cholesterol levels than any other age group.
More patients in the 55-59 age group have high cholesterol levels than any other age group.
fig = px.bar(drug_df, x='Age',y='Na_to_K', height=500, width=900)
fig.update_layout(
template="seaborn",barmode='stack', xaxis={'categoryorder':'total descending'},
title='NA/K Levels by Age')
fig
fig = px.histogram(drug_df, x='Age',color='BP', height=500, width=900)
fig.update_layout(
template="seaborn",barmode='group', xaxis={'categoryorder':'total descending'},
title='Distribution of Blood Pressure Levels by Age')
fig
Most patients in the 45-49 age group have low blood pressure. While most patients in the 20-29 age group have either high or normal blood pressure.
fig = px.histogram(drug_df, x='Age', color='Sex', height=500, width=900)
fig.update_layout(
template='seaborn', barmode='group', xaxis={'categoryorder':'total descending'},
title='Distribution of Age by Gender')
Most males are in the 45-49 age group. And most females are in the 35-39 and 55-59 age groups.
fig = px.scatter(drug_df, x='Drug',y='Age', height=500, width=600)
fig.update_layout(
template="seaborn",barmode='group', xaxis={'categoryorder':'total descending'},
title="Distribution of Patient Ages by Drug Type")
fig
All patients taking Drug B are over the age of 50, and all patients taking Drug A are under 51.
fig = px.histogram(drug_df, x='BP',color='Sex', height=500, width=600)
fig.update_layout(
template="seaborn",barmode='group', xaxis={'categoryorder':'total descending'},
title='Distribution of Blood Pressure Levels by Age ')
fig
Most patients have high blood pressure.
fig = px.histogram(drug_df, x='Sex',color='Cholesterol', height=500, width=600)
fig.update_layout(
template="seaborn",barmode='group', xaxis={'categoryorder':'total descending'},
title='Distribution of Gender by Cholesteral Level')
fig
More males have high and normal chelesterol than females.
fig = px.scatter(drug_df, x='Sex',y='Na_to_K', height=500, width=600)
fig.update_layout(
template="seaborn",barmode='group', xaxis={'categoryorder':'total descending'},
title='Distribution of Na/K Levels by Gender')
fig
fig = px.histogram(drug_df, x='BP',color='Cholesterol', height=500, width=600)
fig.update_layout(
template="seaborn",barmode='group', xaxis={'categoryorder':'total descending'},
title='Distribution of Blood Pressure levels by Cholesterol Levels')
fig
drug_df.loc[(drug_df['BP']=='HIGH') & (drug_df['Cholesterol']=='HIGH'),
'Drug'].value_counts().plot(kind='pie',autopct='%1.1f%%',title='High Blood Pressure and High Cholestrol')
<AxesSubplot:title={'center':'High Blood Pressure and High Cholestrol'}, ylabel='Drug'>
Most patients with high blood pressure and high cholestrol take Drug Y.
drug_df.loc[(drug_df['BP']=='HIGH') & (drug_df['Cholesterol']=='NORMAL'),
'Drug'].value_counts().plot(kind='pie',autopct='%1.1f%%',title='High Blood Pressure and Normal Cholestrol')
<AxesSubplot:title={'center':'High Blood Pressure and Normal Cholestrol'}, ylabel='Drug'>
Most patients with high blood pressure and normal cholestrol take Drug Y.
drug_df.loc[(drug_df['BP']=='LOW') & (drug_df['Cholesterol']=='HIGH'),
'Drug'].value_counts().plot(kind='pie',autopct='%1.1f%%',title='High Blood Pressure and Normal Cholestrol')
<AxesSubplot:title={'center':'High Blood Pressure and Normal Cholestrol'}, ylabel='Drug'>
drug_df.loc[(drug_df['BP']=='HIGH') & (drug_df['Cholesterol']=='HIGH'),
'Drug'].value_counts().plot(kind='pie',autopct='%1.1f%%',title='High Blood Pressure and Normal Cholestrol')
<AxesSubplot:title={'center':'High Blood Pressure and Normal Cholestrol'}, ylabel='Drug'>
drug_df.loc[(drug_df['BP']=='NORMAL') & (drug_df['Cholesterol']=='HIGH'),
'Drug'].value_counts().plot(kind='pie',autopct='%1.1f%%',title='High Blood Pressure and Normal Cholestrol')
<AxesSubplot:title={'center':'High Blood Pressure and Normal Cholestrol'}, ylabel='Drug'>
drug_df.loc[(drug_df['BP']=='NORMAL') & (drug_df['Cholesterol']=='NORMAL'),
'Drug'].value_counts().plot(kind='pie',autopct='%1.1f%%',title='High Blood Pressure and Normal Cholestrol')
<AxesSubplot:title={'center':'High Blood Pressure and Normal Cholestrol'}, ylabel='Drug'>
fig = px.histogram(drug_df, x='Drug',color='BP', height=500, width=600)
fig.update_layout(
template="seaborn",barmode='group', xaxis={'categoryorder':'total descending'},
title='Distribution of Drug Type by Blood Pressure Level')
fig
fig = px.scatter(drug_df, x='Cholesterol',y='Na_to_K', height=500, width=600)
fig.update_layout(
template="seaborn",barmode='group', xaxis={'categoryorder':'total descending'},
title='Distribution of Na/K Ratios by Cholesterol Levels')
fig
fig = px.histogram(drug_df, x='Drug',color='Cholesterol', height=500, width=600)
fig.update_layout(
template="seaborn",barmode='group', xaxis={'categoryorder':'total descending'},
title='Distribution of Drug Types by Cholesterol Levels')
fig
fig = px.scatter(drug_df, x='Drug',y='Na_to_K', height=500, width=600)
fig.update_layout(
template="seaborn",barmode='overlay', xaxis={'categoryorder':'total descending'},
title='Distribution of Drug Types by Na/K Ratio')
fig
drug_df.loc[(drug_df['Na_to_K']<15),'BP'].value_counts().plot(kind='pie',autopct='%1.1f%%',title='Na to K under 15 effect on BP')
<AxesSubplot:title={'center':'Na to K under 15 effect on BP'}, ylabel='BP'>
drug_df.loc[(drug_df['Na_to_K']>15),'BP'].value_counts().plot(kind='pie',autopct='%1.1f%%',title='Na to K over 15 effect on BP')
<AxesSubplot:title={'center':'Na to K over 15 effect on BP'}, ylabel='BP'>
bin_age = [0, 19, 29, 39, 49, 59, 69, 80]
category_age = ['<20s', '20s', '30s', '40s', '50s', '60s', '>60s']
drug_df['Age_binned'] = pd.cut(drug_df['Age'], bins=bin_age, labels=category_age)
drug_df = drug_df.drop(['Age'], axis = 1)
bin_NatoK = [0, 9, 19, 29, 50]
category_NatoK = ['<10', '10-20', '20-30', '>30']
drug_df['Na_to_K_binned'] = pd.cut(drug_df['Na_to_K'], bins=bin_NatoK, labels=category_NatoK)
drug_df = drug_df.drop(['Na_to_K'], axis = 1)
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
We need to seperate the response variable from the predictor variables.
X = drug_df.drop(["Drug"], axis=1)
y = drug_df["Drug"]
To maximize the accuracy of our models with a small data set, we will split 33% test sample and 67% training sample.
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 0)
print('The Shape Of The Original Data: ', drug_df.shape)
print('The Shape Of The Test Sample: ', x_test.shape)
print('The Shape Of The Training Sample: ', x_train.shape)
print('The Shape Of The Test Sample: ', y_test.shape)
print('The Shape Of The Training Sample: ', y_train.shape)
The Shape Of The Original Data: (200, 6) The Shape Of The Test Sample: (66, 5) The Shape Of The Training Sample: (134, 5) The Shape Of The Test Sample: (66,) The Shape Of The Training Sample: (134,)
This confirms that our test sample is 33% of the full data set.
drug_df['Drug'].value_counts()
DrugY 91 drugX 54 drugA 23 drugC 16 drugB 16 Name: Drug, dtype: int64
drug_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 200 entries, 0 to 199 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sex 200 non-null object 1 BP 200 non-null object 2 Cholesterol 200 non-null object 3 Drug 200 non-null object 4 Age_binned 200 non-null category 5 Na_to_K_binned 200 non-null category dtypes: category(2), object(4) memory usage: 7.3+ KB
x_train = pd.get_dummies(x_train)
x_test = pd.get_dummies(x_test)
x_train.head()
| Sex_F | Sex_M | BP_HIGH | BP_LOW | BP_NORMAL | Cholesterol_HIGH | Cholesterol_NORMAL | Age_binned_<20s | Age_binned_20s | Age_binned_30s | Age_binned_40s | Age_binned_50s | Age_binned_60s | Age_binned_>60s | Na_to_K_binned_<10 | Na_to_K_binned_10-20 | Na_to_K_binned_20-30 | Na_to_K_binned_>30 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 54 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 163 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 51 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 86 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| 139 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
x_test.head()
| Sex_F | Sex_M | BP_HIGH | BP_LOW | BP_NORMAL | Cholesterol_HIGH | Cholesterol_NORMAL | Age_binned_<20s | Age_binned_20s | Age_binned_30s | Age_binned_40s | Age_binned_50s | Age_binned_60s | Age_binned_>60s | Na_to_K_binned_<10 | Na_to_K_binned_10-20 | Na_to_K_binned_20-30 | Na_to_K_binned_>30 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 170 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 107 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 98 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 177 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
print("X_train", x_train.shape)
print("X_test", x_test.shape)
print("y_train", y_train.shape)
print("y_test", y_test.shape)
X_train (134, 18) X_test (66, 18) y_train (134,) y_test (66,)
SMOTE or Synthetic Minority Oversampling Technique is an oversampling technique but SMOTE working differently than your typical oversampling.
In a classic oversampling technique, the minority data is duplicated from the minority data population. While it increases the number of data, it does not give any new information or variation to the machine learning model.
SMOTE works by utilizing a k-nearest neighbour algorithm to create synthetic data. SMOTE first start by choosing random data from the minority class, then k-nearest neighbours from the data are set. Synthetic data would then be made between the random data and the randomly selected k-nearest neighbour. Let me show you the example below.
https://towardsdatascience.com/5-smote-techniques-for-oversampling-your-imbalance-data-b8155bdbe2b5
#pip install imblearn
sns.set_theme(style="darkgrid")
sns.countplot(y=y_train, data=drug_df, palette="mako_r")
plt.ylabel('Drug Type')
plt.xlabel('Total')
plt.show()
This shows us that the training set is not balanced.
from imblearn.over_sampling import SMOTE
x_train, y_train = SMOTE().fit_resample(x_train, y_train)
sns.set_theme(style="darkgrid")
sns.countplot(y=y_train, data=drug_df, palette="mako_r")
plt.ylabel('Drug Type')
plt.xlabel('Total')
plt.show()
This shows us that the data set has been balanced to the distribution of drug type.
This type of statistical model (also known as logit model) is often used for classification and predictive analytics. Logistic regression estimates the probability of an event occurring, such as voted or didn’t vote, based on a given dataset of independent variables. Since the outcome is a probability, the dependent variable is bounded between 0 and 1. In logistic regression, a logit transformation is applied on the odds—that is, the probability of success divided by the probability of failure. This is also commonly known as the log odds, or the natural logarithm of odds.
from sklearn.linear_model import LogisticRegression
LRclassifier = LogisticRegression(solver='liblinear', max_iter=5000)
LRclassifier.fit(x_train, y_train)
y_pred = LRclassifier.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
from sklearn.metrics import accuracy_score
LRAcc = accuracy_score(y_pred,y_test)
print('Logistic Regression accuracy is: {:.2f}%'.format(LRAcc*100))
precision recall f1-score support
DrugY 1.00 0.74 0.85 34
drugA 0.71 1.00 0.83 5
drugB 0.75 1.00 0.86 3
drugC 0.67 1.00 0.80 4
drugX 0.83 1.00 0.91 20
accuracy 0.86 66
macro avg 0.79 0.95 0.85 66
weighted avg 0.90 0.86 0.86 66
[[25 2 1 2 4]
[ 0 5 0 0 0]
[ 0 0 3 0 0]
[ 0 0 0 4 0]
[ 0 0 0 0 20]]
Logistic Regression accuracy is: 86.36%
K-Nearest Neighbor is a lazy learning algorithm which stores all instances correspond to training data points in n-dimensional space. When an unknown discrete data is received, it analyzes the closest k number of instances saved (nearest neighbors)and returns the most common class as the prediction and for real-valued data it returns the mean of k nearest neighbors.
In the distance-weighted nearest neighbor algorithm, it weights the contribution of each of the k neighbors according to their distance using the following query giving greater weight to the closest neighbors.
Usually KNN is robust to noisy data since it is averaging the k-nearest neighbors.
https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623
from sklearn.neighbors import KNeighborsClassifier
KNclassifier = KNeighborsClassifier(n_neighbors=20)
KNclassifier.fit(x_train, y_train)
y_pred = KNclassifier.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
KNAcc = accuracy_score(y_pred,y_test)
print('K Neighbours accuracy is: {:.2f}%'.format(KNAcc*100))
precision recall f1-score support
DrugY 0.79 0.68 0.73 34
drugA 0.38 0.60 0.46 5
drugB 0.50 0.67 0.57 3
drugC 0.50 0.50 0.50 4
drugX 0.81 0.85 0.83 20
accuracy 0.71 66
macro avg 0.60 0.66 0.62 66
weighted avg 0.74 0.71 0.72 66
[[23 4 1 2 4]
[ 1 3 1 0 0]
[ 0 1 2 0 0]
[ 2 0 0 2 0]
[ 3 0 0 0 17]]
K Neighbours accuracy is: 71.21%
Support Vector Machine, abbreviated as SVM can be used for both regression and classification tasks. But, it is widely used in classification objectives. The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.
To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.
from sklearn.svm import SVC
SVCclassifier = SVC(kernel='linear', max_iter=251)
SVCclassifier.fit(x_train, y_train)
y_pred = SVCclassifier.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
SVCAcc = accuracy_score(y_pred,y_test)
print('SVC accuracy is: {:.2f}%'.format(SVCAcc*100))
precision recall f1-score support
DrugY 0.92 0.71 0.80 34
drugA 0.75 0.60 0.67 5
drugB 0.75 1.00 0.86 3
drugC 0.50 1.00 0.67 4
drugX 0.83 1.00 0.91 20
accuracy 0.82 66
macro avg 0.75 0.86 0.78 66
weighted avg 0.85 0.82 0.82 66
[[24 1 1 4 4]
[ 2 3 0 0 0]
[ 0 0 3 0 0]
[ 0 0 0 4 0]
[ 0 0 0 0 20]]
SVC accuracy is: 81.82%
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/svm/_base.py:301: ConvergenceWarning: Solver terminated early (max_iter=251). Consider pre-processing your data with StandardScaler or MinMaxScaler.
Naive Bayes is a probabilistic classifier inspired by the Bayes theorem under a simple assumption which is the attributes are conditionally independent.
The classification is conducted by deriving the maximum posterior which is the maximal P(Ci|X) with the above assumption applying to Bayes theorem. This assumption greatly reduces the computational cost by only counting the class distribution. Even though the assumption is not valid in most cases since the attributes are dependent, surprisingly Naive Bayes has able to perform impressively.
Naive Bayes is a very simple algorithm to implement and good results have obtained in most cases. It can be easily scalable to larger datasets since it takes linear time, rather than by expensive iterative approximation as used for many other types of classifiers.
Naive Bayes can suffer from a problem called the zero probability problem. When the conditional probability is zero for a particular attribute, it fails to give a valid prediction. This needs to be fixed explicitly using a Laplacian estimator.
https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623
from sklearn.naive_bayes import CategoricalNB
NBclassifier1 = CategoricalNB()
NBclassifier1.fit(x_train, y_train)
y_pred = NBclassifier1.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
NBAcc1 = accuracy_score(y_pred,y_test)
print('Naive Bayes accuracy is: {:.2f}%'.format(NBAcc1*100))
precision recall f1-score support
DrugY 0.89 0.71 0.79 34
drugA 0.62 1.00 0.77 5
drugB 0.75 1.00 0.86 3
drugC 0.50 0.50 0.50 4
drugX 0.74 0.85 0.79 20
accuracy 0.77 66
macro avg 0.70 0.81 0.74 66
weighted avg 0.79 0.77 0.77 66
[[24 3 1 2 4]
[ 0 5 0 0 0]
[ 0 0 3 0 0]
[ 0 0 0 2 2]
[ 3 0 0 0 17]]
Naive Bayes accuracy is: 77.27%
from sklearn.naive_bayes import GaussianNB
NBclassifier2 = GaussianNB()
NBclassifier2.fit(x_train, y_train)
y_pred = NBclassifier2.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
NBAcc2 = accuracy_score(y_pred,y_test)
print('Gaussian Naive Bayes accuracy is: {:.2f}%'.format(NBAcc2*100))
precision recall f1-score support
DrugY 0.67 0.91 0.78 34
drugA 0.71 1.00 0.83 5
drugB 0.75 1.00 0.86 3
drugC 1.00 0.50 0.67 4
drugX 1.00 0.35 0.52 20
accuracy 0.73 66
macro avg 0.83 0.75 0.73 66
weighted avg 0.80 0.73 0.70 66
[[31 2 1 0 0]
[ 0 5 0 0 0]
[ 0 0 3 0 0]
[ 2 0 0 2 0]
[13 0 0 0 7]]
Gaussian Naive Bayes accuracy is: 72.73%
Decision tree builds classification or regression models in the form of a tree structure. It utilizes an if-then rule set which is mutually exclusive and exhaustive for classification. The rules are learned sequentially using the training data one at a time. Each time a rule is learned, the tuples covered by the rules are removed. This process is continued on the training set until meeting a termination condition.
The tree is constructed in a top-down recursive divide-and-conquer manner. All the attributes should be categorical. Otherwise, they should be discretized in advance. Attributes in the top of the tree have more impact towards in the classification and they are identified using the information gain concept.
A decision tree can be easily over-fitted generating too many branches and may reflect anomalies due to noise or outliers. An over-fitted model has a very poor performance on the unseen data even though it gives an impressive performance on training data. This can be avoided by pre-pruning which halts tree construction early or post-pruning which removes branches from the fully grown tree.
https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
DTclassifier = DecisionTreeClassifier(criterion='gini', max_depth=10, random_state=0)
DTclassifier = DTclassifier.fit(x_train, y_train)
y_pred = DTclassifier.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
DTAcc = accuracy_score(y_pred,y_test)
print('Decision Tree accuracy is: {:.2f}%'.format(DTAcc*100))
precision recall f1-score support
DrugY 0.92 0.65 0.76 34
drugA 0.50 0.80 0.62 5
drugB 0.75 1.00 0.86 3
drugC 0.50 1.00 0.67 4
drugX 0.86 0.95 0.90 20
accuracy 0.79 66
macro avg 0.71 0.88 0.76 66
weighted avg 0.84 0.79 0.79 66
[[22 4 1 4 3]
[ 1 4 0 0 0]
[ 0 0 3 0 0]
[ 0 0 0 4 0]
[ 1 0 0 0 19]]
Decision Tree accuracy is: 78.79%
#conda install graphviz
import graphviz
dot_data = tree.export_graphviz(DTclassifier, out_file=None,
feature_names=x_train.columns,
class_names=y_train,
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph
Error: <stdin>: syntax error in line 20 near ';'
--------------------------------------------------------------------------- CalledProcessError Traceback (most recent call last) File ~/opt/anaconda3/lib/python3.9/site-packages/graphviz/backend/execute.py:91, in run_check(cmd, input_lines, encoding, quiet, **kwargs) 90 try: ---> 91 proc.check_returncode() 92 except subprocess.CalledProcessError as e: File ~/opt/anaconda3/lib/python3.9/subprocess.py:460, in CompletedProcess.check_returncode(self) 459 if self.returncode: --> 460 raise CalledProcessError(self.returncode, self.args, self.stdout, 461 self.stderr) CalledProcessError: Command '[PosixPath('dot'), '-Kdot', '-Tsvg']' returned non-zero exit status 1. During handling of the above exception, another exception occurred: CalledProcessError Traceback (most recent call last) File ~/opt/anaconda3/lib/python3.9/site-packages/IPython/core/formatters.py:973, in MimeBundleFormatter.__call__(self, obj, include, exclude) 970 method = get_real_method(obj, self.print_method) 972 if method is not None: --> 973 return method(include=include, exclude=exclude) 974 return None 975 else: File ~/opt/anaconda3/lib/python3.9/site-packages/graphviz/jupyter_integration.py:98, in JupyterIntegration._repr_mimebundle_(self, include, exclude, **_) 96 include = set(include) if include is not None else {self._jupyter_mimetype} 97 include -= set(exclude or []) ---> 98 return {mimetype: getattr(self, method_name)() 99 for mimetype, method_name in MIME_TYPES.items() 100 if mimetype in include} File ~/opt/anaconda3/lib/python3.9/site-packages/graphviz/jupyter_integration.py:98, in <dictcomp>(.0) 96 include = set(include) if include is not None else {self._jupyter_mimetype} 97 include -= set(exclude or []) ---> 98 return {mimetype: getattr(self, method_name)() 99 for mimetype, method_name in MIME_TYPES.items() 100 if mimetype in include} File ~/opt/anaconda3/lib/python3.9/site-packages/graphviz/jupyter_integration.py:112, in JupyterIntegration._repr_image_svg_xml(self) 110 def _repr_image_svg_xml(self) -> str: 111 """Return the rendered graph as SVG string.""" --> 112 return self.pipe(format='svg', encoding=SVG_ENCODING) File ~/opt/anaconda3/lib/python3.9/site-packages/graphviz/piping.py:104, in Pipe.pipe(self, format, renderer, formatter, neato_no_op, quiet, engine, encoding) 55 def pipe(self, 56 format: typing.Optional[str] = None, 57 renderer: typing.Optional[str] = None, (...) 61 engine: typing.Optional[str] = None, 62 encoding: typing.Optional[str] = None) -> typing.Union[bytes, str]: 63 """Return the source piped through the Graphviz layout command. 64 65 Args: (...) 102 '<?xml version=' 103 """ --> 104 return self._pipe_legacy(format, 105 renderer=renderer, 106 formatter=formatter, 107 neato_no_op=neato_no_op, 108 quiet=quiet, 109 engine=engine, 110 encoding=encoding) File ~/opt/anaconda3/lib/python3.9/site-packages/graphviz/_tools.py:171, in deprecate_positional_args.<locals>.decorator.<locals>.wrapper(*args, **kwargs) 162 wanted = ', '.join(f'{name}={value!r}' 163 for name, value in deprecated.items()) 164 warnings.warn(f'The signature of {func.__name__} will be reduced' 165 f' to {supported_number} positional args' 166 f' {list(supported)}: pass {wanted}' 167 ' as keyword arg(s)', 168 stacklevel=stacklevel, 169 category=category) --> 171 return func(*args, **kwargs) File ~/opt/anaconda3/lib/python3.9/site-packages/graphviz/piping.py:121, in Pipe._pipe_legacy(self, format, renderer, formatter, neato_no_op, quiet, engine, encoding) 112 @_tools.deprecate_positional_args(supported_number=2) 113 def _pipe_legacy(self, 114 format: typing.Optional[str] = None, (...) 119 engine: typing.Optional[str] = None, 120 encoding: typing.Optional[str] = None) -> typing.Union[bytes, str]: --> 121 return self._pipe_future(format, 122 renderer=renderer, 123 formatter=formatter, 124 neato_no_op=neato_no_op, 125 quiet=quiet, 126 engine=engine, 127 encoding=encoding) File ~/opt/anaconda3/lib/python3.9/site-packages/graphviz/piping.py:149, in Pipe._pipe_future(self, format, renderer, formatter, neato_no_op, quiet, engine, encoding) 146 if encoding is not None: 147 if codecs.lookup(encoding) is codecs.lookup(self.encoding): 148 # common case: both stdin and stdout need the same encoding --> 149 return self._pipe_lines_string(*args, encoding=encoding, **kwargs) 150 try: 151 raw = self._pipe_lines(*args, input_encoding=self.encoding, **kwargs) File ~/opt/anaconda3/lib/python3.9/site-packages/graphviz/backend/piping.py:212, in pipe_lines_string(engine, format, input_lines, encoding, renderer, formatter, neato_no_op, quiet) 206 cmd = dot_command.command(engine, format, 207 renderer=renderer, 208 formatter=formatter, 209 neato_no_op=neato_no_op) 210 kwargs = {'input_lines': input_lines, 'encoding': encoding} --> 212 proc = execute.run_check(cmd, capture_output=True, quiet=quiet, **kwargs) 213 return proc.stdout File ~/opt/anaconda3/lib/python3.9/site-packages/graphviz/backend/execute.py:93, in run_check(cmd, input_lines, encoding, quiet, **kwargs) 91 proc.check_returncode() 92 except subprocess.CalledProcessError as e: ---> 93 raise CalledProcessError(*e.args) 95 return proc CalledProcessError: Command '[PosixPath('dot'), '-Kdot', '-Tsvg']' returned non-zero exit status 1. [stderr: "Error: <stdin>: syntax error in line 20 near ';'\n"]
<graphviz.sources.Source at 0x7fde50f300a0>
Random forest is a commonly-used machine learning algorithm, which combines the output of multiple decision trees to reach a single result. Its ease of use and flexibility have fueled its adoption, as it handles both classification and regression problems.
from sklearn.ensemble import RandomForestClassifier
RFclassifier = RandomForestClassifier(max_leaf_nodes=30)
RFclassifier.fit(x_train, y_train)
y_pred = RFclassifier.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
RFAcc = accuracy_score(y_pred,y_test)
print('Random Forest accuracy is: {:.2f}%'.format(RFAcc*100))
precision recall f1-score support
DrugY 1.00 0.59 0.74 34
drugA 0.50 1.00 0.67 5
drugB 0.75 1.00 0.86 3
drugC 0.50 1.00 0.67 4
drugX 0.83 1.00 0.91 20
accuracy 0.79 66
macro avg 0.72 0.92 0.77 66
weighted avg 0.87 0.79 0.79 66
[[20 5 1 4 4]
[ 0 5 0 0 0]
[ 0 0 3 0 0]
[ 0 0 0 4 0]
[ 0 0 0 0 20]]
Random Forest accuracy is: 78.79%
compare = pd.DataFrame({'Model': ['Logistic Regression', 'K Neighbors', 'SVM', 'Categorical NB', 'Gaussian NB', 'Decision Tree', 'Random Forest'],
'Accuracy': [LRAcc*100, KNAcc*100, SVCAcc*100, NBAcc1*100, NBAcc2*100, DTAcc*100, RFAcc*100]})
compare.sort_values(by='Accuracy', ascending=False)
| Model | Accuracy | |
|---|---|---|
| 0 | Logistic Regression | 86.363636 |
| 2 | SVM | 81.818182 |
| 5 | Decision Tree | 78.787879 |
| 6 | Random Forest | 78.787879 |
| 3 | Categorical NB | 77.272727 |
| 4 | Gaussian NB | 72.727273 |
| 1 | K Neighbors | 71.212121 |